| Title | Abstract | Author(s) | Date |
| Performance of Ability Estimation Methods for Writing Assessments under Conditions of Multidimensionality | An increasing number of large scale assessments contain constructed response items such
as essays for the advantages they offer over traditional multiple-choice measures. Writing
assessments in particular often contain a mixture of multiple-choice and essay items. These
mixed-format assessments pose many technical challenges for psychometricians. This study
directly builds upon the Meyers et al. (2009) study by investigating how ability estimation, essay
scoring approach, measurement model, and proportion of points allocated to multiple choice
items and the essay item on mixed-format assessments interact to recover ability and item
parameter estimates under different degrees of multidimensionality. | Meyers, Jason L.
Turhan, Ahmet
Fitzpatrick, Steven J. | 05-2010 |
| What Item Writers Think When Writing Items: Towards A Theory OF Item Writing Expertise | The study of expert item writers offers the possibility of “bottling” the knowledge and skills acquired by these experts over years of hard work. The descriptions of the identified conceptual knowledge and skills of expert item writers could be incorporated into item writing workshops in order to equip new item writers with the tools necessary to produce quality figural response items. | Fulkerson, Dennis
Nichols, Paul
Mittelholtz, David | 05-2010 |
| Running Head: Predicting ELP A Multi-level Modeling Approach to Predicting Performance on a State ELA Assessment | The purpose of this study was to examine on a State English Language Proficiency Examination for grades K-12 (a) the performance of students in low SES environments vs. high SES environments as measured by school Title I participation, (b) the performance of males vs. females, (c) the effect of ethnicity( Hispanic vs. non-Hispanic students), and (d) any interaction effects. | Brown, Raymond S.
Nguyen, T.
Stephenson, A. | 05-2010 |
| Comparisons of Test Characteristic Curve Alignment Criteria of the Anchor Set and the Total Test: Maintaining Test Scale and Impacts on Student Performance | The current paper investigates a tenet of the traditional view on the psychometric
characteristics of such anchor sets. Specifically, the traditional guideline, without any specificity, states that the test characteristic curve (TCC) of the anchor set and the total test should be closely overlapped. | Karkee, Thakur B., Ph. D
Fatica, Kevin
Murphy, Stephen T., Ph. D. | 05-2010 |
| Running Head: IMPACT OF DIFFERENT ANCHOR STABILITY METHODS
The Impact of Different Anchor Stability Methods on Equating Results and Student Performance | The key objective of this study is to demonstrate a methodological procedure or
strategy for examining the different anchor stability procedures and the accompanying
results and to evaluate the impact on the final RSSS tables and reported cut scores (i.e.,
performance levels). For our study we did not include the bivariate plots for the old and
new parameter values. | Murphy, Stephen
Little, Ian
Fan, Meichu
Lin, Chow-Hong
Kirkpatrick, Rob | 05-2010 |
| Improving the Post-Smoothing of Test Norms with Kernel Smoothing | The traditional methodology of apost-smoothing to develop norms used on educational
and clinic products is to hand-smooth the scale scores or their distributions. This approach is
very subjective, difficult to replicate, and extremely labor intensive. In hand-smoothing, the
scores or distributions are adjusted based on personal judgment. Different persons, or same
person at different times, will make significantly different judgments. By contrast, the kernel
smoothing method is a nonparametric approach, which is more flexible, less subjective, and
easier to replicate. | Lin, Anli Yi, Qing Young, Michael J. | 05-2010 |
| The Modified Briefing Book Standard Setting Method:
Using Validity Data as a Basis for Setting Cut Scores | This paper focuses on two aspects of the modified briefing book standard setting process
developed to meet this need: 1) the validity research conducted to support the standard
setting; and 2) the standard setting itself, through which the validity research and
associated pertinent information was organized and presented to the panelists, and
resulting process through which these data were used to elicit cut score judgments. | Miles, Julie A. Beimers, Jennifer N. Way, Walter D. | 05-2010 |
| Impact of Non-representative Anchor
Items on Scale Stability | This study attempts to fill this gap by simulating item response data over
multiple administrations under the common-item nonequivalent groups design and
examining the effects of non-representative anchor items on scale stability. | Wei, Hua | 05-2010 |
| Rater Effects as a Function of Rater Training Context | This study examined the influence of rater training and scoring context on the
manifestation of rater effects in a group of trained raters. | Wolfe, Edward W. McVay, Aaron | 05-2010 |
| The Hazards of Newness: A Portrait of Challenges Faced by New High School English Teachers | This paper reports findings of a survey study designed to examine how high school English
teachers are assigned to teach particular grades and track levels, whether these teachers have
their own classrooms, and how they and their students perceive one another. | Bieler, Deborah Holmes, Stephen Wolfe, Edward W. | 05-2010 |
| IRT Proficiency Estimators and Their Impact | In the current study, we further examined the statistical properties of the various
IRT estimators, especially focusing on their practical impact on the reported scores. We
4
also investigated a few practical scenarios, where the testing focus is on assessing college
readiness, assessing students’ minimal competency, or providing estimates for students
who have failed a previous exam (retesters). | Tong, Ye Kolen, Michael J. | 05-2010 |
| Correlates of Mathematics Achievement in Developed and Developing Countries: An HLM Analysis of TIMSS 2003 Eighth-grade Mathematics Scores | The purpose of this study was to investigate correlates of math achievement in both developed and developing countries. Specifically, two developed countries and two developing countries that participated in the TIMSS 2003 eighth-grade math assessment were selected for this study. For each country, contextual factors at both the student and the teacher/school levels were used to construct Correlates of Math Achievement 3 models that yield country-specific findings related to students’ math performance. | Phan, Ha Sentovich, Christina Kromrey, Jeffrey Dedrick, Robert Ferron, John | 05-2010 |
| AutoCorreleation in the COFM. The effects of Autocorrelation on the Curve-of-factors Growth Model | This simulation study examined the performance of the curve-of-factors model (COFM) when autocorrelation and grwth processes were present in the first-level factor sturcture. In addition to the standard curve-of-factors growth model, two new models were examined: one COFM that included a first-order autoagressive atuocorrelation parameter, and a second model that included first-order autoregressive and voving average autocorrelation parameters. | Murphy, Daniel J.
Beretvas, S Natasha
Pituch, Keenan A | 05-2010 |
| Distractor Rationale Taxonomy: Diagnostic Assessment of Reading with Ordered Multiple-Choice Items | The distractor rataionale taxonomy (DRT) examined in this study is an understanding-level-driven distractor analysis system for multiple-choice items. The DRT purposely creates distrators at different comprehension levels to pinpoint sources of misunderstanding. | Lin, Jie
Lee Chu, Kwang
Meng, Ying | 05-2010 |
| Investigating Approaches to Estimate an Individual's Strand/objective Score Profile Reliability: A Monte Carlo Study | The paper studies performance of generalizability and classical test theory reliability approaches to estimate reliability of an individual's strand/objective score profile. | Arce-Ferrer, Alvaro J. | 05-2010 |
| Deriviation of a Profile Reliability Index for an Individual: A Multi-Factor Congeneric Approach with Guttnam Error Type Structures | The paper discusses results and proposes research to substantiate current supporting evidenc for the operational use of the profile reliability approach | Arce-Ferrer, Alvaro J. | 11-2009 |
| Growth, Precision, and CAT: An Examination of Gain Score Conditional SEM | Measurement of student growth is an important topic for K-12 state testing programs, both in terms of school accountability as well as for reporting progress of individual students. | Thompson, Tony D. | 06-2008 |
| Effects of Different Training and Scoring Approaches on Human Constructed Response Scoring | This paper summarizes and discusses research studies related to the human scoring of constructed response items that have been conducted recently at a large scale testing company. | Nichols, Paul Vickers, Daisy Way, Walter D. | 04-2008 |
| Person-fit of English Language Learners (ELL) in K-12 High-Stakes Assessments | The No Child Left Behind Act holds states using federal funds accountable for student academic achievement. | Wan, Lei Wu, Brad | 04-2008 |
| User-Centered Assessment Design | In this paper, we introduce user-centered assessment design (UCAD), an approach to test design intended to produce assessments that deliver to teachers the kind of complex information on student learning and knowledge that they can combine with sound pedagogical practice to improve student achievement. | Adams, Jeremy Mittelholtz, David Nichols, Paul Van Duesen, Robert | 03-2008 |
| A Tale of Two Modes: A Case Study in User-centered Design’s Role in Comparability and Construct Validity | Introduction: UCD’s Role within User-centered Assessment Design One merit of user-centered assessment design (UCAD) as defined by Nichols et al (2008) is its broadened view of test development. | Strain-Seymour, Ellen, PhD | 03-2008 |
| Usability and Design Considerations for Computer-based Learning and Assessment | The overall success of computer-based products and systems is dependent to a significant extent on their usability and usefulness in the intended context. | Adams, Jeremy Harms, Michael | 03-2008 |
| Field Testing and Equating Designs for State Educational Assessments | The educational accountability movement has spawned unprecedented numbers of new assessments. For example, the No Child Left Behind Act of 2002 (NCLB) required states to test students in grades 3 through 8 and at one grade in high school each year. | Kirkpatrick, Rob Way, Walter D. | 03-2008 |
| An Investigation of the Changes in Item Parameter Estimates for Items Re-field Tested | Large-scale state testing programs typically rely upon a large bank of items to select from when building assessments. | Kong, Xiaojing Jadie McClarty, Katie Larsen Meyers, Jason L. | 03-2008 |
| A Comparison of Pre-Equating and Post-Equating Using Large-Scale Assessment Data | Equating is a statistical process that is used to adjust scores on test forms so that scores on the forms can be used interchangeably (Kolen & Brennan, 2004), even though the test forms consist of different items. | Tong, Ye Wu, Sz-Shyan Xu, Ming | 03-2008 |
| Maintenance of Vertical Scales | Vertical scaling refers to the process of placing scores of tests that measure similar domains but at different educational levels onto a common scale, a vertical scale. | Kolen, Michael J. Ye, Tong | 03-2008 |
| Evidence of Test Score Use in Validity: Roles and Responsibilites | This paper has three goals. | Nichols, Paul D. Williams, Natasha | 03-2008 |
| Score Reporting, Off-the-Shelf Assessments and NCLB: Truly and Unholy Trinity | One consequence resulting from NCLB, particularly as instructional time becomes more precious, is the desire to be more efficient in assessing learning. | Twing, Jon S., PhD | 03-2008 |
| Applying a User-Centered Design Approach to Data Management: Paer and Computer Testing | This paper discusses the application of a user-centered design (UCD) approach to a web-based application system that supports data management components of the high-stakes assessment lifecycle. | Wilson, Jeffrey R., PhD | 03-2008 |
| Exploring the Use of Item Bank Information to Improve IRT Item Parameter Estimation | On occasion, the sample of students available for calibrating a set of assessment items may not be optimal. | Ansley, Timothy Hall, Erika | |
| A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing | Testlet response theory (TRT) is a measurement model that can capture local dependency in testlet-based tests. | Chen, Tzu-An Ann Dodd, Barbara G. Ho, Tsung-Han Keng, Leslie | |
| Response Probability Criterion and Subgroup Performance | In the standard setting literature, there has been much debate about the most appropriate response probability (RP) to use in an item mapping procedure such as the Bookmark Standard Setting Procedure. | Egan, Karla Mueller, Canda D. Schneider, M. Christina | |
| A Generalization of Stratified α that Allows for Correlated Measurement Errors between Subtests | This paper presents a generalization of Stratified α that allows for correlated measurement errors between some subtest scores that make up a composite score. | Keng, Leslie Miller, G. Edward O'Malley, Kimberly Turhan, Ahmet | |